Name: Kiran Shrestha
Title: Data for health policy
Link to the page: here
The project primarily investigates the data related to health factors of each counties in USA. Health factors here include health behaviors, clinical care, socio-economic factors, physical enviornment and other health outcomes. Using available data along with additional public datasets, I plan to find the find possible discoveries regarding what variables are most responsible for health outcomes. I am sure there are metrics to measure like correlations to differentiate those. Using the variables, I plan to create a model and possibly test with new data sources.
I plan to first find more datasets that I can relate this dataset to, and thus have more available dependent measures that could infulence the health outcomes. Maybe, the demographics, education quality, or presence or absence of certain institutions could add more light to the health results. Github will be primarily used to store all the data and notebooks.
pip install missingno
Collecting missingno Downloading missingno-0.5.2-py3-none-any.whl (8.7 kB) Requirement already satisfied: numpy in /opt/conda/lib/python3.11/site-packages (from missingno) (1.24.4) Requirement already satisfied: matplotlib in /opt/conda/lib/python3.11/site-packages (from missingno) (3.7.2) Requirement already satisfied: scipy in /opt/conda/lib/python3.11/site-packages (from missingno) (1.11.2) Requirement already satisfied: seaborn in /opt/conda/lib/python3.11/site-packages (from missingno) (0.12.2) Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (1.1.0) Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (4.42.1) Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (1.4.5) Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (23.1) Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (10.0.0) Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.11/site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: pandas>=0.25 in /opt/conda/lib/python3.11/site-packages (from seaborn->missingno) (2.0.3) Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=0.25->seaborn->missingno) (2023.3) Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0) Installing collected packages: missingno Successfully installed missingno-0.5.2 Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# import pycountry_convert as pc
import missingno as mno
import warnings
URL = "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2023_0.csv"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}
with warnings.catch_warnings():
warnings.simplefilter('ignore')
df = pd.read_csv(URL, storage_options=headers);
# with warnings.catch_warnings():
# warnings.simplefilter('ignore')
# df = pd.read_csv("data/analytic_data2023_0.csv")
df.head()
| State FIPS Code | County FIPS Code | 5-digit FIPS Code | State Abbreviation | Name | Release Year | County Ranked (Yes=1/No=0) | Premature Death raw value | Premature Death numerator | Premature Death denominator | ... | % Female raw value | % Female numerator | % Female denominator | % Female CI low | % Female CI high | % Rural raw value | % Rural numerator | % Rural denominator | % Rural CI low | % Rural CI high | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | statecode | countycode | fipscode | state | county | year | county_ranked | v001_rawvalue | v001_numerator | v001_denominator | ... | v057_rawvalue | v057_numerator | v057_denominator | v057_cilow | v057_cihigh | v058_rawvalue | v058_numerator | v058_denominator | v058_cilow | v058_cihigh |
| 1 | 00 | 000 | 00000 | US | United States | 2023 | NaN | 7281.9355638 | 4125218 | 917267406 | ... | 0.5047067187 | 167509003 | 331893745 | NaN | NaN | 0.193 | NaN | NaN | NaN | NaN |
| 2 | 01 | 000 | 01000 | AL | Alabama | 2023 | NaN | 10350.071456 | 88086 | 13668498 | ... | 0.5142542169 | 2591778 | 5039877 | NaN | NaN | 0.409631829 | 1957932 | 4779736 | NaN | NaN |
| 3 | 01 | 001 | 01001 | AL | Autauga County | 2023 | 1 | 8027.3947267 | 836 | 156081 | ... | 0.513782892 | 30362 | 59095 | NaN | NaN | 0.4200216232 | 22921 | 54571 | NaN | NaN |
| 4 | 01 | 003 | 01003 | AL | Baldwin County | 2023 | 1 | 8118.3582061 | 3377 | 614143 | ... | 0.5134771453 | 122872 | 239294 | NaN | NaN | 0.4227909911 | 77060 | 182265 | NaN | NaN |
5 rows × 720 columns
df.shape
(3195, 720)
# Display top 10 columns
for col in df.columns[:10]:
print(col)
State FIPS Code County FIPS Code 5-digit FIPS Code State Abbreviation Name Release Year County Ranked (Yes=1/No=0) Premature Death raw value Premature Death numerator Premature Death denominator
This page describes about the idea behind the dataset. This link has all the datasets from different years to download. This page has the all the further sources that were used for mining the ultimate data. The dataset has 700+ features to work with, although there are similarities among multiple columns and missing data.
Primarily, the data columns can be divided in to health factors and health outcomes.
# This plot shows the missing data
# Longer the bar, lesser the missing data
mno.bar(df)
<Axes: >
for col in df.columns:
if df[col].isnull().sum()>1000:
df.drop([col], axis=1, inplace=True)
# cols from 720 to 326
df.shape
(3195, 326)
mno.bar(df)
<Axes: >
A lot of columns give repitative meaning. So, we extract the ones that is enough to represent the particular measurement.
# We need the raw values only
new_cols = [x for x in df.columns if "raw value" in x]
new_cols = list(df.columns[0:5]) + new_cols
# Replace % by percent
cols = list(map(lambda x:x.replace("%", "percent"), new_cols))
# Remove certain char and substring
cols = list(map(lambda x:x.replace("-", " "), cols))
cols = list(map(lambda x:x.replace(" raw value", ""), cols))
cols = list(map(lambda x:x.replace(" ", "_"), cols))
cols = list(map(lambda x:x.replace(" ", ""), cols))
cols
['State_FIPS_Code', 'County_FIPS_Code', '5_digit_FIPS_Code', 'State_Abbreviation', 'Name', 'Premature_Death', 'Poor_or_Fair_Health', 'Poor_Physical_Health_Days', 'Poor_Mental_Health_Days', 'Low_Birthweight', 'Adult_Smoking', 'Adult_Obesity', 'Food_Environment_Index', 'Physical_Inactivity', 'Access_to_Exercise_Opportunities', 'Excessive_Drinking', 'Alcohol_Impaired_Driving_Deaths', 'Sexually_Transmitted_Infections', 'Teen_Births', 'Uninsured', 'Primary_Care_Physicians', 'Dentists', 'Mental_Health_Providers', 'Preventable_Hospital_Stays', 'Mammography_Screening', 'Flu_Vaccinations', 'High_School_Completion', 'Some_College', 'Unemployment', 'Children_in_Poverty', 'Income_Inequality', 'Children_in_Single_Parent_Households', 'Social_Associations', 'Injury_Deaths', 'Air_Pollution___Particulate_Matter', 'Drinking_Water_Violations', 'Severe_Housing_Problems', 'Driving_Alone_to_Work', 'Long_Commute___Driving_Alone', 'Life_Expectancy', 'Premature_Age_Adjusted_Mortality', 'Frequent_Physical_Distress', 'Frequent_Mental_Distress', 'Diabetes_Prevalence', 'HIV_Prevalence', 'Food_Insecurity', 'Limited_Access_to_Healthy_Foods', 'Insufficient_Sleep', 'Uninsured_Adults', 'Uninsured_Children', 'Other_Primary_Care_Providers', 'High_School_Graduation', 'Reading_Scores', 'Math_Scores', 'School_Segregation', 'School_Funding_Adequacy', 'Gender_Pay_Gap', 'Median_Household_Income', 'Children_Eligible_for_Free_or_Reduced_Price_Lunch', 'Child_Care_Cost_Burden', 'Child_Care_Centers', 'Suicides', 'Firearm_Fatalities', 'Motor_Vehicle_Crash_Deaths', 'Voter_Turnout', 'Census_Participation', 'Traffic_Volume', 'Homeownership', 'Severe_Housing_Cost_Burden', 'Broadband_Access', 'Population', 'percent_Below_18_Years_of_Age', 'percent_65_and_Older', 'percent_Non_Hispanic_Black', 'percent_American_Indian_or_Alaska_Native', 'percent_Asian', 'percent_Native_Hawaiian_or_Other_Pacific_Islander', 'percent_Hispanic', 'percent_Non_Hispanic_White', 'percent_Not_Proficient_in_English', 'percent_Female', 'percent_Rural']
# Slice the dataframe
df = df[new_cols]
# Rename the columns
df = df.rename(columns=dict(zip(new_cols, cols)))
df.head(2)
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | State_Abbreviation | Name | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | statecode | countycode | fipscode | state | county | v001_rawvalue | v002_rawvalue | v036_rawvalue | v042_rawvalue | v037_rawvalue | ... | v053_rawvalue | v054_rawvalue | v055_rawvalue | v081_rawvalue | v080_rawvalue | v056_rawvalue | v126_rawvalue | v059_rawvalue | v057_rawvalue | v058_rawvalue |
| 1 | 00 | 000 | 00000 | US | United States | 7281.9355638 | 0.12 | 3 | 4.4 | 0.0819065527 | ... | 0.1682705801 | 0.1261202919 | 0.0131594526 | 0.0613162595 | 0.0026003593 | 0.1887563262 | 0.5930615866 | 0.0410440385 | 0.5047067187 | 0.193 |
2 rows × 82 columns
# remove the first row
df = df.drop([0])
df = df.reset_index(drop=True)
df.head(2)
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | State_Abbreviation | Name | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 00 | 000 | 00000 | US | United States | 7281.9355638 | 0.12 | 3 | 4.4 | 0.0819065527 | ... | 0.1682705801 | 0.1261202919 | 0.0131594526 | 0.0613162595 | 0.0026003593 | 0.1887563262 | 0.5930615866 | 0.0410440385 | 0.5047067187 | 0.193 |
| 1 | 01 | 000 | 01000 | AL | Alabama | 10350.071456 | 0.189 | 3.4824161407 | 5.0732772786 | 0.1043276003 | ... | 0.1763568833 | 0.2651199623 | 0.0071444204 | 0.0155043466 | 0.0010883202 | 0.0478519615 | 0.6487709918 | 0.0102759588 | 0.5142542169 | 0.409631829 |
2 rows × 82 columns
# Checking the states
df["State_Abbreviation"].unique()
array(['US', 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL',
'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM',
'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)
df[df["State_Abbreviation"] =="WY"].head(3)
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | State_Abbreviation | Name | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3170 | 56 | 0 | 56000 | WY | Wyoming | 7809.903503 | 0.115 | 2.698914 | 4.130766 | 0.090792 | ... | 0.179469 | 0.010394 | 0.028395 | 0.010935 | 0.001012 | 0.10554 | 0.833306 | 0.006424 | 0.48823 | 0.35242 |
| 3171 | 56 | 1 | 56001 | WY | Albany County | 5133.53187 | 0.11 | 2.90064 | 4.179786 | 0.085394 | ... | 0.129866 | 0.012949 | 0.013162 | 0.034567 | 0.001409 | 0.101627 | 0.821581 | 0.006262 | 0.47817 | 0.119397 |
| 3172 | 56 | 3 | 56003 | WY | Big Horn County | 9097.45733 | 0.123 | 2.998264 | 3.865339 | 0.069968 | ... | 0.217675 | 0.007479 | 0.018054 | 0.005416 | 0.000516 | 0.096114 | 0.867435 | 0.015205 | 0.491145 | 1.0 |
3 rows × 82 columns
The column where State_Abbreviation is US represent the country average and where State_Abbreviation is state name represent the state average.
County_FIPS_Code is 0 if county name is state itself.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3194 entries, 0 to 3193 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State_FIPS_Code 3194 non-null object 1 County_FIPS_Code 3194 non-null object 2 5_digit_FIPS_Code 3194 non-null object 3 State_Abbreviation 3194 non-null object 4 Name 3194 non-null object 5 Premature_Death 3134 non-null object 6 Poor_or_Fair_Health 3192 non-null object 7 Poor_Physical_Health_Days 3192 non-null object 8 Poor_Mental_Health_Days 3192 non-null object 9 Low_Birthweight 3088 non-null object 10 Adult_Smoking 3192 non-null object 11 Adult_Obesity 3192 non-null object 12 Food_Environment_Index 3161 non-null object 13 Physical_Inactivity 3192 non-null object 14 Access_to_Exercise_Opportunities 3132 non-null object 15 Excessive_Drinking 3192 non-null object 16 Alcohol_Impaired_Driving_Deaths 3167 non-null object 17 Sexually_Transmitted_Infections 3071 non-null object 18 Teen_Births 3005 non-null object 19 Uninsured 3193 non-null object 20 Primary_Care_Physicians 3047 non-null object 21 Dentists 3108 non-null object 22 Mental_Health_Providers 2993 non-null object 23 Preventable_Hospital_Stays 3123 non-null object 24 Mammography_Screening 3173 non-null object 25 Flu_Vaccinations 3176 non-null object 26 High_School_Completion 3194 non-null object 27 Some_College 3194 non-null object 28 Unemployment 3193 non-null object 29 Children_in_Poverty 3193 non-null object 30 Income_Inequality 3187 non-null object 31 Children_in_Single_Parent_Households 3193 non-null object 32 Social_Associations 3194 non-null object 33 Injury_Deaths 3089 non-null object 34 Air_Pollution___Particulate_Matter 3167 non-null object 35 Drinking_Water_Violations 3149 non-null object 36 Severe_Housing_Problems 3194 non-null object 37 Driving_Alone_to_Work 3194 non-null object 38 Long_Commute___Driving_Alone 3194 non-null object 39 Life_Expectancy 3124 non-null object 40 Premature_Age_Adjusted_Mortality 3134 non-null object 41 Frequent_Physical_Distress 3192 non-null object 42 Frequent_Mental_Distress 3192 non-null object 43 Diabetes_Prevalence 3192 non-null object 44 HIV_Prevalence 2735 non-null object 45 Food_Insecurity 3194 non-null object 46 Limited_Access_to_Healthy_Foods 3161 non-null object 47 Insufficient_Sleep 3192 non-null object 48 Uninsured_Adults 3193 non-null object 49 Uninsured_Children 3193 non-null object 50 Other_Primary_Care_Providers 3183 non-null object 51 High_School_Graduation 2362 non-null object 52 Reading_Scores 2826 non-null object 53 Math_Scores 2739 non-null object 54 School_Segregation 2962 non-null object 55 School_Funding_Adequacy 3133 non-null object 56 Gender_Pay_Gap 3187 non-null object 57 Median_Household_Income 3192 non-null object 58 Children_Eligible_for_Free_or_Reduced_Price_Lunch 2606 non-null object 59 Child_Care_Cost_Burden 3192 non-null object 60 Child_Care_Centers 3044 non-null object 61 Suicides 2485 non-null object 62 Firearm_Fatalities 2323 non-null object 63 Motor_Vehicle_Crash_Deaths 2743 non-null object 64 Voter_Turnout 3164 non-null object 65 Census_Participation 3142 non-null object 66 Traffic_Volume 3041 non-null object 67 Homeownership 3194 non-null object 68 Severe_Housing_Cost_Burden 3189 non-null object 69 Broadband_Access 3194 non-null object 70 Population 3194 non-null object 71 percent_Below_18_Years_of_Age 3194 non-null object 72 percent_65_and_Older 3194 non-null object 73 percent_Non_Hispanic_Black 3194 non-null object 74 percent_American_Indian_or_Alaska_Native 3194 non-null object 75 percent_Asian 3194 non-null object 76 percent_Native_Hawaiian_or_Other_Pacific_Islander 3194 non-null object 77 percent_Hispanic 3194 non-null object 78 percent_Non_Hispanic_White 3194 non-null object 79 percent_Not_Proficient_in_English 3194 non-null object 80 percent_Female 3194 non-null object 81 percent_Rural 3187 non-null object dtypes: object(82) memory usage: 2.0+ MB
print(df.head(2).T.to_string())
0 1 State_FIPS_Code 00 01 County_FIPS_Code 000 000 5_digit_FIPS_Code 00000 01000 State_Abbreviation US AL Name United States Alabama Premature_Death 7281.9355638 10350.071456 Poor_or_Fair_Health 0.12 0.189 Poor_Physical_Health_Days 3 3.4824161407 Poor_Mental_Health_Days 4.4 5.0732772786 Low_Birthweight 0.0819065527 0.1043276003 Adult_Smoking 0.16 0.195 Adult_Obesity 0.32 0.393 Food_Environment_Index 7 5.3 Physical_Inactivity 0.22 0.278 Access_to_Exercise_Opportunities 0.8423863046 0.6092667226 Excessive_Drinking 0.19 0.1614162693 Alcohol_Impaired_Driving_Deaths 0.2655507901 0.258869637 Sexually_Transmitted_Infections 481.3 552.2 Teen_Births 19.300572586 27.598889304 Uninsured 0.1044496729 0.1182271569 Primary_Care_Physicians 0.0007637606 0.0006579252 Dentists 0.0007246807 0.0004869166 Mental_Health_Providers 0.0029570126 0.0012541973 Preventable_Hospital_Stays 2809 3599 Mammography_Screening 0.37 0.36 Flu_Vaccinations 0.51 0.44 High_School_Completion 0.8887404032 0.8740270016 Some_College 0.6725325979 0.6150082742 Unemployment 0.0535291312 0.0343902829 Children_in_Poverty 0.169 0.227 Income_Inequality 4.8913749294 5.1766763312 Children_in_Single_Parent_Households 0.2512967212 0.3090921916 Social_Associations 9.1296963648 11.910925297 Injury_Deaths 75.899512272 86.9057184 Air_Pollution___Particulate_Matter 7.4 9.3 Drinking_Water_Violations NaN 0.1343283582 Severe_Housing_Problems 0.1696721824 0.1315678879 Driving_Alone_to_Work 0.732358592 0.8378249329 Long_Commute___Driving_Alone 0.365 0.35 Life_Expectancy 78.528894654 74.83594896 Premature_Age_Adjusted_Mortality 358.7460227 499.86855039 Frequent_Physical_Distress 0.09 0.1107739678 Frequent_Mental_Distress 0.14 0.1648429623 Diabetes_Prevalence 0.09 0.13 HIV_Prevalence 379.7 341.6 Food_Insecurity 0.118 0.145 Limited_Access_to_Healthy_Foods 0.0610019647 0.0876054853 Insufficient_Sleep 0.33 0.3924300962 Uninsured_Adults 0.123766561 0.1491000099 Uninsured_Children 0.0539542665 0.0362680404 Other_Primary_Care_Providers 0.0012318702 0.0010861376 High_School_Graduation 0.87 0.9071081634 Reading_Scores 3.0534 2.885602535 Math_Scores 3.003 2.72218766 School_Segregation 0.2454 0.2817412656 School_Funding_Adequacy 1062 -3868.511 Gender_Pay_Gap 0.8100444614 0.7418970988 Median_Household_Income 69717 53990 Children_Eligible_for_Free_or_Reduced_Price_Lunch 0.5308547682 0.53338294 Child_Care_Cost_Burden 0.2659357065 0.2722218184 Child_Care_Centers 6.8638668282 5.5092316855 Suicides 13.818282988 16.200669652 Firearm_Fatalities 12.430330228 22.293899524 Motor_Vehicle_Crash_Deaths 11.591311264 20.205514853 Voter_Turnout 0.6790952146 0.6263600041 Census_Participation 0.652 NaN Traffic_Volume 505.31 213.69282656 Homeownership 0.646331101 0.6939478703 Severe_Housing_Cost_Burden 0.1427574897 0.1194424811 Broadband_Access 0.8700069587 0.8204571454 Population 331893745 5039877 percent_Below_18_Years_of_Age 0.2216565817 0.2226744819 percent_65_and_Older 0.1682705801 0.1763568833 percent_Non_Hispanic_Black 0.1261202919 0.2651199623 percent_American_Indian_or_Alaska_Native 0.0131594526 0.0071444204 percent_Asian 0.0613162595 0.0155043466 percent_Native_Hawaiian_or_Other_Pacific_Islander 0.0026003593 0.0010883202 percent_Hispanic 0.1887563262 0.0478519615 percent_Non_Hispanic_White 0.5930615866 0.6487709918 percent_Not_Proficient_in_English 0.0410440385 0.0102759588 percent_Female 0.5047067187 0.5142542169 percent_Rural 0.193 0.409631829
We can convert most of the columns into float type.
# Fill the NaN with np.nan
df.fillna(np.nan, inplace =True)
# list of cols to convert into float
to_float= [col for col in list(df.columns) if col not in list(df.columns[3:5])]
df[to_float] = df[to_float].apply(pd.to_numeric)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3194 entries, 0 to 3193 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State_FIPS_Code 3194 non-null int64 1 County_FIPS_Code 3194 non-null int64 2 5_digit_FIPS_Code 3194 non-null int64 3 State_Abbreviation 3194 non-null object 4 Name 3194 non-null object 5 Premature_Death 3134 non-null float64 6 Poor_or_Fair_Health 3192 non-null float64 7 Poor_Physical_Health_Days 3192 non-null float64 8 Poor_Mental_Health_Days 3192 non-null float64 9 Low_Birthweight 3088 non-null float64 10 Adult_Smoking 3192 non-null float64 11 Adult_Obesity 3192 non-null float64 12 Food_Environment_Index 3161 non-null float64 13 Physical_Inactivity 3192 non-null float64 14 Access_to_Exercise_Opportunities 3132 non-null float64 15 Excessive_Drinking 3192 non-null float64 16 Alcohol_Impaired_Driving_Deaths 3167 non-null float64 17 Sexually_Transmitted_Infections 3071 non-null float64 18 Teen_Births 3005 non-null float64 19 Uninsured 3193 non-null float64 20 Primary_Care_Physicians 3047 non-null float64 21 Dentists 3108 non-null float64 22 Mental_Health_Providers 2993 non-null float64 23 Preventable_Hospital_Stays 3123 non-null float64 24 Mammography_Screening 3173 non-null float64 25 Flu_Vaccinations 3176 non-null float64 26 High_School_Completion 3194 non-null float64 27 Some_College 3194 non-null float64 28 Unemployment 3193 non-null float64 29 Children_in_Poverty 3193 non-null float64 30 Income_Inequality 3187 non-null float64 31 Children_in_Single_Parent_Households 3193 non-null float64 32 Social_Associations 3194 non-null float64 33 Injury_Deaths 3089 non-null float64 34 Air_Pollution___Particulate_Matter 3167 non-null float64 35 Drinking_Water_Violations 3149 non-null float64 36 Severe_Housing_Problems 3194 non-null float64 37 Driving_Alone_to_Work 3194 non-null float64 38 Long_Commute___Driving_Alone 3194 non-null float64 39 Life_Expectancy 3124 non-null float64 40 Premature_Age_Adjusted_Mortality 3134 non-null float64 41 Frequent_Physical_Distress 3192 non-null float64 42 Frequent_Mental_Distress 3192 non-null float64 43 Diabetes_Prevalence 3192 non-null float64 44 HIV_Prevalence 2735 non-null float64 45 Food_Insecurity 3194 non-null float64 46 Limited_Access_to_Healthy_Foods 3161 non-null float64 47 Insufficient_Sleep 3192 non-null float64 48 Uninsured_Adults 3193 non-null float64 49 Uninsured_Children 3193 non-null float64 50 Other_Primary_Care_Providers 3183 non-null float64 51 High_School_Graduation 2362 non-null float64 52 Reading_Scores 2826 non-null float64 53 Math_Scores 2739 non-null float64 54 School_Segregation 2962 non-null float64 55 School_Funding_Adequacy 3133 non-null float64 56 Gender_Pay_Gap 3187 non-null float64 57 Median_Household_Income 3192 non-null float64 58 Children_Eligible_for_Free_or_Reduced_Price_Lunch 2606 non-null float64 59 Child_Care_Cost_Burden 3192 non-null float64 60 Child_Care_Centers 3044 non-null float64 61 Suicides 2485 non-null float64 62 Firearm_Fatalities 2323 non-null float64 63 Motor_Vehicle_Crash_Deaths 2743 non-null float64 64 Voter_Turnout 3164 non-null float64 65 Census_Participation 3142 non-null float64 66 Traffic_Volume 3041 non-null float64 67 Homeownership 3194 non-null float64 68 Severe_Housing_Cost_Burden 3189 non-null float64 69 Broadband_Access 3194 non-null float64 70 Population 3194 non-null int64 71 percent_Below_18_Years_of_Age 3194 non-null float64 72 percent_65_and_Older 3194 non-null float64 73 percent_Non_Hispanic_Black 3194 non-null float64 74 percent_American_Indian_or_Alaska_Native 3194 non-null float64 75 percent_Asian 3194 non-null float64 76 percent_Native_Hawaiian_or_Other_Pacific_Islander 3194 non-null float64 77 percent_Hispanic 3194 non-null float64 78 percent_Non_Hispanic_White 3194 non-null float64 79 percent_Not_Proficient_in_English 3194 non-null float64 80 percent_Female 3194 non-null float64 81 percent_Rural 3187 non-null float64 dtypes: float64(76), int64(4), object(2) memory usage: 2.0+ MB
df.describe()
| State_FIPS_Code | County_FIPS_Code | 5_digit_FIPS_Code | Premature_Death | Poor_or_Fair_Health | Poor_Physical_Health_Days | Poor_Mental_Health_Days | Low_Birthweight | Adult_Smoking | Adult_Obesity | ... | percent_65_and_Older | percent_Non_Hispanic_Black | percent_American_Indian_or_Alaska_Native | percent_Asian | percent_Native_Hawaiian_or_Other_Pacific_Islander | percent_Hispanic | percent_Non_Hispanic_White | percent_Not_Proficient_in_English | percent_Female | percent_Rural | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3194.000000 | 3194.000000 | 3194.000000 | 3134.000000 | 3192.000000 | 3192.000000 | 3192.000000 | 3088.000000 | 3192.000000 | 3192.000000 | ... | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3194.000000 | 3187.000000 |
| mean | 30.249530 | 101.886662 | 30351.417032 | 8891.562734 | 0.159942 | 3.511726 | 4.794971 | 0.082138 | 0.199762 | 0.361428 | ... | 0.199929 | 0.090869 | 0.024611 | 0.016971 | 0.001625 | 0.102183 | 0.749862 | 0.016072 | 0.495715 | 0.580467 |
| std | 15.160981 | 107.624838 | 15179.045587 | 2929.948857 | 0.044333 | 0.652486 | 0.628114 | 0.020293 | 0.041210 | 0.046825 | ... | 0.047879 | 0.141564 | 0.077649 | 0.030939 | 0.009667 | 0.139670 | 0.202763 | 0.026852 | 0.023189 | 0.315553 |
| min | 0.000000 | 0.000000 | 0.000000 | 3090.426825 | 0.065000 | 1.849017 | 2.779181 | 0.028871 | 0.067000 | 0.176000 | ... | 0.050729 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006827 | 0.026802 | 0.000000 | 0.245614 | 0.000000 |
| 25% | 18.000000 | 33.000000 | 18171.500000 | 6868.647904 | 0.125000 | 3.027309 | 4.373272 | 0.068281 | 0.174000 | 0.336000 | ... | 0.169189 | 0.008182 | 0.004311 | 0.005208 | 0.000377 | 0.026999 | 0.630136 | 0.002579 | 0.490583 | 0.325275 |
| 50% | 29.000000 | 77.000000 | 29174.000000 | 8538.518058 | 0.152000 | 3.448386 | 4.813037 | 0.079532 | 0.198000 | 0.366000 | ... | 0.195519 | 0.024266 | 0.007193 | 0.008099 | 0.000721 | 0.048823 | 0.821402 | 0.007069 | 0.499580 | 0.588250 |
| 75% | 45.000000 | 133.000000 | 45074.500000 | 10494.403953 | 0.189000 | 3.946273 | 5.221064 | 0.091418 | 0.226000 | 0.391000 | ... | 0.225286 | 0.104902 | 0.014716 | 0.015733 | 0.001357 | 0.108941 | 0.915241 | 0.017709 | 0.507127 | 0.861214 |
| max | 56.000000 | 840.000000 | 56045.000000 | 30007.870277 | 0.368000 | 6.335031 | 6.945581 | 0.216981 | 0.411000 | 0.532000 | ... | 0.581710 | 0.856197 | 0.922567 | 0.420553 | 0.475610 | 0.962604 | 0.975921 | 0.384369 | 0.570535 | 1.000000 |
8 rows × 80 columns
Relationship between sleep and obesity in LA and CA
x = "Adult_Obesity"
y = "Insufficient_Sleep"
z = "State_Abbreviation"
not_null_mask = df[[x,y,z]].notnull().all(axis=1)
not_null_rows = df[[x,y,z]][not_null_mask]
not_null_rows = not_null_rows.query('State_Abbreviation== "LA" or State_Abbreviation== "CA"')
sns.scatterplot(data=not_null_rows, x = x, y = y, hue = z)
<Axes: xlabel='Adult_Obesity', ylabel='Insufficient_Sleep'>
sns.scatterplot(data=df, x = "Broadband_Access", y = "Math_Scores")
<Axes: xlabel='Broadband_Access', ylabel='Math_Scores'>
Splitting the columns into health factors(variables) and healt outcomes types(target)
target_cols = ['Premature_Death', 'Life_Expectancy', 'Premature_Age_Adjusted_Mortality',
'Poor_or_Fair_Health','Poor_Physical_Health_Days', 'Poor_Mental_Health_Days','Low_Birthweight',
'Frequent_Physical_Distress','Frequent_Mental_Distress', 'Diabetes_Prevalence', 'HIV_Prevalence']
variable_cols = [x for x in df.columns[5:] if x not in target_cols]
df_corr = df.iloc[:,5:].corr()
df_corr.shape
(77, 77)
df_corr = df_corr[variable_cols]
df_corr = df_corr.loc[target_cols]
df_corr.shape
(11, 66)
sns.heatmap(df_corr.T, annot = True, annot_kws={"fontsize":7})
plt.xticks(fontsize=8)
plt.yticks(fontsize=9)
sns.set(rc={'figure.figsize':(10,15)})
Finding the features obesity is most correlated to
obesity_corr = list(df.iloc[:, 5:].corr()[["Adult_Obesity"]].sort_values(by = "Adult_Obesity").index)
obesity_corr = obesity_corr[:5] + obesity_corr[-7:-1]
obesity_corr
['Life_Expectancy', 'Median_Household_Income', 'Some_College', 'Voter_Turnout', 'Broadband_Access', 'Premature_Age_Adjusted_Mortality', 'Frequent_Physical_Distress', 'Adult_Smoking', 'Poor_or_Fair_Health', 'Diabetes_Prevalence', 'Physical_Inactivity']
Few more plots
sns.scatterplot(data=df, x = "Median_Household_Income", y = "Adult_Obesity")
sns.set(rc={'figure.figsize':(6,6)})
sns.scatterplot(data=df, x = "Adult_Smoking", y = "Adult_Obesity")
sns.set(rc={'figure.figsize':(6,6)})
state_df = df[ pd.to_numeric(df["County_FIPS_Code"]) == 0]
sns.barplot(state_df.sort_values(by = ["Adult_Obesity"]), x="Adult_Obesity", y="State_Abbreviation")
plt.ylabel("State Names")
plt.xlabel("Adult obesity")
plt.yticks(fontsize=8)
plt.title("Obesity rates among adults in different US states", {'fontsize': 20} )
sns.set(rc={'figure.figsize':(10,9)})
I plan to explore the datasets why some states or counties are good in health comes and why others are not. Other questions include, "what factors influence the health outcomes the most?","What affects the obesity most?", "Does the state/county location matter in health outcome?","why certain demograohic has a correlation with health results?" and so on.
Besides, it would be cool to explore how political preferences affect the health status of a certain area. Directly or indirectly, there will be some influence in the policies, which futher influences the general public's health behaviors.
Similarly, socailly vulnerability might have association to health outcomes too. So, I plan to explore more into it.
Hopefully, I can find more data and variables to merge with this one, and with better data analysis, I could figure what variables to include in a model. Here, the model will be used to predict the health outcome such as mortality or obesity based on easily available dataset.